Accelerating the Fourier split operator method via graphics processing units 
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Current generations of graphics processing units have turned into highly parallel devices with general computing 
capabilities. Thus, graphics processing units may be utilized, for example, to solve time dependent partial 
differential equations by the Fourier split operator method. In this contribution, we demonstrate that graphics 
processing units are capable to calculate fast Fourier transforms much more efficiently than traditional central 
processing units. Thus, graphics processing units render efficient implementations of the Fourier split operator 
method possible. Performance gains of more than an order of magnitude as compared to implementations for 
traditional central processing units are reached in the solution of the time dependent Schrodinger equation and 
the time dependent Dirac equation. 
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I. Introduction 

For several decades, users and developers of high perfor- 
mance computing applications could trust on Moore's law. 
The number of transistors that can be placed inexpensively 
on an integrated circuit has doubled approximately every two 
years. This exponential growth provided the basis for an 
ever-increasing computing power of modern central process- 
ing units (CPUs). Thus, high performance computing appli- 
cations became faster by just trusting on Moore's law and 
waiting for faster CPUs. Thanks to Moore's law, today's 
desktop computers supply the computational power of the last 
decade's super-computers. 

Transistor counts still continue to grow exponentially. 
Hardware manufacturers, however, no longer make single 
computing units more complex and faster. Clock rates and 
execution optimizations have reached their limits. Now hard- 
ware manufacturers use the plethora of transistors to put many 
computing units on a single chip manufacturing highly paral- 
lel computing architectures. There are two major directions 
of development, multicore architectures with a few (typically 
two to a few dozen in the near future) general purpose comput- 
ing units and dedicated accelerators with several hundred or 
more computing units with reduced capabilities each as com- 
pared to traditional CPUs [1, 2]. These accelerators may be 
implemented in field-programmable gate arrays or may be cus- 
tom built systems as the Cell Broadband Engine Architecture 
or the ClearSpeed™ CSX700 Processor, for example. Graph- 
ics processing units (GPUs) with general purpose computing 
capabilities brought accelerated computing to the mass mar- 
ket. Employing GPUs to perform computations that are tra- 
ditionally handled by CPUs is commonly referred to as GPU 
computing [3]. 

Parallel architectures are not new in the high performance 
computing community. In the past, many high performance 
computing applications employed various kinds of parallel 
computing architectures which had been the high-end com- 



puting systems of their time. Today, however, even low-end 
consumer computers are parallel systems. The new emerg- 
ing ubiquitous parallel architectures indicate a silent paradigm 
change in computing or as Herb Sutter [4] pointed out "The 
free lunch is over." In order to exploit the power of modern 
hardware architectures it is required to write parallel applica- 
tions. The term "computing" becomes synonymous to "paral- 
lel computing." 

The purpose of this contribution is to evaluate the Fermi 
GPU computing architecture and to demonstrate how an im- 
plementation of the Fourier split operator method on GPUs 
may boost the performance of the numerical propagation of 
time dependent partial differential equations. The remainder 
of this paper is organized as follows. In section II we give a 
generic description of the Fourier split operator method and 
show how to apply it to the time dependent Schrodinger equa- 
tion and the time dependent Dirac equation. Section III gives 
a very short introduction to the CUDA architecture and de- 
scribes our GPU implementation of the Fourier split operator 
method while section IV presents performance comparisons. 
In section V we illustrate some applications of the Fourier 
split operator method. 



II. The Fourier split operator method 

Let us consider the Cauchy type initial value problem problem 



dw(x, t) 
dt 



A(t)w(x, t) 



(la) 



with some possibly time dependent differential operator A(t) 
acting on the coordinates x = (xi,X2, ■ ■ ■ ,x n ) and with the 
initial condition 



w(x, 0) = vfo(jc) . 



(lb) 
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The function w(x, t) depends on x and the time coordinate t 
and may be real or complex and possibly vector valued. The 
Cauchy problem (1) includes, for example, the diffusion equa- 
tion, the Focker-Planck equation [5], the current-free Maxwell 
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equations [6], the Schrtidinger equation, the Pauli equation, 
the Dirac equation [7], as well as the Klein-Gordon equa- 
tion in the Feshbach-Villars representation [8] or the Black- 
Scholes options pricing equation. 

Using Dyson's time ordering operator T, the formal solu- 
tion of (1) is given by 

w(x, t) = U(t, 0)w(x, 0) (2) 

with the time-evolution operator [5, 9, 10] 

U(t 2 ,h) = fexp^ 1 A(t')dt'j (3) 

that effects the function's evolution from time t\ to time t 2 . 
For some highly symmetric operators A(f) in (1), analytic ex- 
pressions of the time-evolution operator (3) can be calculated. 
However, investigations of many relevant problems require 
numerical methods, for example, the Fourier split operator 
method, which we will shortly describe in the following para- 
graphs. 

11.1 . General outline of the Fourier split 
operator method 

Fleck et al. [11] introduced the Fourier split operator method 
as an explicit time stepping scheme for the solution of the time 
dependent scalar Maxwell wave equation in Fresnel approx- 
imation. The method is applicable to Cauchy problems (1) 
with a differential operator A(t) that has the property that it 
may be separated into a sum of two operators 

A(t)=A 1 (t)+A 2 (t) (4) 

such that A\(f) can be easily diagonalized in real space while 
A2(t) can be easily diagonalized in Fourier space; a require- 
ment that is fulfilled by many partial differential equations 
of relevance. Thus, shortly after Feit et al. [12] solved the 
Schrtidinger equation with a time independent Hamiltonian 
numerically by the Fourier split operator method, it became 
a standard tool for the propagation of quantum mechanical 
wave equations. Later, the method had been generalized to the 
Schrtidinger equation with a time dependent Hamiltonian [13] 
and it was applied to other equations, for example, the Dirac 
equation [14—17], the time dependent Gross-Pitaevskii equa- 
tion [18], the non-linear Schrtidinger equation [19], and the 
time dependent Maxwell equations for electromagnetic waves 
in random dielectric media [6]. 

The central idea of the Fourier split operator method is to 
approximate the operator (3) by a product of operators that are 
diagonal either in real space or in momentum space. Let O(t) 
denote some possibly time dependent operator and define the 
operator 

6 (t 2 ,h,#) = exp{s J* <5<r)d?'j , (5) 

that depends on the times t\ and t 2 and the auxiliary param- 
eter 6 and let U g(t 2 , h, S) denote the operator (5) in Fourier 



space. Expanding U (t + r, f) to the third order in t and 
provided that the operator A(t) has the form (4), the time- 
evolution operator (3) can be factorized [9] into 

U(t + r,f) = ex P^ A(f')df'j + 0(T 3 ) = 
U A] (t + r, t, i) (J M (t + t, t, 1) U Ai (t + t, t, i) + O (r 3 ) . (6) 

Neglecting terms of order O (r 3 ), equation (6) gives an explicit 
second order accurate time-stepping scheme for the propaga- 
tion of the function w(x, t) 

w(x, t + t) = 

U Ai (t + t, t, i) A2 (t + r, f, 1) U Ai (t + t, t, i) w(x, t)+0 (r 3 ) . 

(7) 

This scheme translates the difficulty of calculating the action 
of operator (3) to the task of calculating the action of (5) for 
O = A\ and O = A 2 , respectively. Strang [20] utilized the 
splitting (7) to calculate the action of (5) in real space by a 
finite differences scheme. For many Cauchy problems (1), 
however, one can find a splitting (4) such that the operator 
U A (t + t, t, 6) is diagonal in real space and U Al (t + r, t, 6) is 
diagonal in Fourier space. Thus, the calculation of these op- 
erators becomes feasible in the appropriate space and (7) is 
calculated via 

w(x, t + t) = 

u Ai (t + t, ?, i) r- x [ti A2 (t + T,t,\)T [u Ai (t + r, t , i) w(x, <■)}} 

+ 0(t 3 ). (8) 

The expression T {■} in (8) denotes the Fourier transform of 
the argument and T~ x {•} the inverse Fourier transform. 

In a computer implementation of the Fourier split operator 
method, the function w(x, t) is discretized on a rectangular reg- 
ular lattice of points and the continuous Fourier transform 
is approximated by a discrete Fourier transform. The compu- 
tational complexity of propagating the function w{x, t) from 
time t to time t + r is dominated by the transformation into 
Fourier space and back into real space. If these transforms are 
accomplished by the fast Fourier transform the computation of 
an elementary step of the Fourier split operator method takes 
0(N log N) operations. 



11.2. Fourier split operator method for the 
Schrodinger equation 

The Schrtidinger equation is an equation of motion for a 
complex-valued scalar wave function ^(x, t) evolving in a 
<f -dimensional space x and in time t. In its most general form, 
it describes a non-relativistic spin zero particle of mass m and 
charge q in the electromagnetic potentials <p(x, f) and A(x, t). 
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The Schrtidinger equation is invariant under Galilean transfor- 
mation and reads 



dVjx, t) 
dt 



x,t)\V(x,t). (9) 



— (-nN-qA{x,t)f + l 
2m 

In order to apply the Fourier split operator method for prop- 
agating ^(x, t) under the effect of (9), we have to restrict the 
vector potential to homogeneous fields A(x, t) = A{t); an ap- 
proximation that is known as the dipole approximation. Split- 
ting the Hamiltonian (9) in dipole approximation into a poten- 
tial energy term and a kinetic energy term 



inA] 



Cm), 



iM 2 = — (-fiiV - qA(t)f 
2m 



(10a) 
(10b) 



separates the spatial dependent parts from spatial derivatives, 
which makes the operator (t + t, t, 6) diagonal in real 
space and U A (t + t, t, 8) diagonal in momentum space, re- 
spectively. The action of (t + t, t, 8) on a real space wave 
function is given by 



Us (t + r, t, 8) t) = exp 



Br 



qcf>(x,t')dt'j x i'(x,t) 
(ID 

and the action of the Fourier space operator (t + t, t, 8) to 
a wave function in Fourier space 

< ¥(p,t)=r{W(x,t)} = 

J «p(-fr-x/»)d'* (12) 

reads 



U Ai {t + T,t,S)W{p,t) = 



4 rj_ 

n J, 2m 



(p-qA(t')f dt'\W(p,t). (13) 



exp -8 



Note that it is crucial for the application of the Fourier split 
operator method that the vector potential A(t) does not depend 
on the spatial coordinate x. The expansion of the Hamiltonian 
of the Schrtidinger equation (9) for a particle in an arbitrary 
vector potential A{x,t) contains the term (iqn/m)A(x,t) ■ V, 
that is spatially dependent and contains spatial derivatives, 
too, coupling momentum and coordinate space. Coupling be- 
tween momentum and coordinate space is absent for vector 
potentials in dipole approximation but also, for example, for 
the vector potential of a linearly polarized plane wave with 
A(x,f) = (AiCc 3 , 0.0,0) [17]. 



11.3. Fourier split operator method for the 
Dirac equation 

In contrast to the Schrtidinger equation, the Dirac equation [7] 
describes a relativistic spin half particle and it is invariant un- 
der Lorentz transformation. A Dirac wave function ^(x, t) is a 



two component (for one-dimensional systems) or four compo- 
nent (for two- or three-dimensional systems) complex-valued 
vector function. The Dirac equation for a particle of mass m 
and charge q moving in the electromagnetic potentials (p{x, t) 
and A(x,t) is given by 



m 



dt 

d 



( 

C 



-in- qAj(x, t) + q(f)(x, t) + mc"B 

on 



y(x, t) 
(14) 



with the matrices a-, and f3 and the speed of light c in vacuum 
and Aj(x, t) denoting the ;th component of the vector potential 
A(x, t). The matrices a/, ft obey the algebra 

af = /3 2 = 1 , a,a k + a k aj = 28^ , aft + /far,- = . (15) 

This algebra determines the matrices a-, and (3 only up to uni- 
tary transforms. In numerical applications, we adopted the 
so-called Dirac representation with 



ff3 



(0 1 
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10 

U o o o) 

1 0^ 

-1 
10 

to -1 



U2 



(0 -it 
i 

-i 

U o o o, 

1 0^ 
10 
0-10 

to -1 



(16) 



For one-dimensional systems, the Dirac representation re- 
duces to 



1 

1 



p 



1 
-1 



(17) 



In order to apply the Fourier split operator method to the 
Dirac equation (14), we split the Hamiltonian into an interac- 
tion part and a free particle part 



U 

inAi = c^ aii (-qAi(x, t)) + qc/)(x, t) , 

;=1 

inki — c a,- (— iti— ) + mc 2 p 



(18a) 
(18b) 



and calculate (t + r, t, 8) and U Al (t + r, t, 8) in the follow- 
ing way. 

The operator Ujt (f + t, t, 8) may be determined by splitting 
A[ further into 



A, = Ai i + A 



1,2 



with 



iMi_i = q(p(x, t) , 

d 

iMi, 2 = a-, (-qAi(x, t)) . 



(19) 

(20a) 
(20b) 
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Because Ai ; i and Ai,2 commute we may factorize the operator 
U Ai (t + r,t,6) into 

U Ai (f + t, t, 6) = U Au (t + t, t, 6) U Ai 2 (t + r, t, 6) (21) 
with the diagonal operator 



U Aii (t + r, t, 6) y(x, f) = exp 



i f 



q<p(x, t')At'^{x,t). 

(22) 

The operator U A , (f + t, t, 6), however, involves matrix expo- 
nentials, which may be calculated by taking into account the 
Dirac algebra (15). Exponentials of a, are obtained by sum- 
ming the exponential-function's Taylor sum explicitly. Intro- 
ducing some auxiliary complex numbers a, we find 



exp 



V i=i J k=o ' v i=i 



( d \ 



( d \ 



i> i*=i ) 

d 



3 (— 1)* 

(2k)\ f-f ) j-( (2Jfc+l)! 



/ d ^ 
Vi'=l 



Z a,a; 



2A- 



z 

/.=() 



(-1)* 

(2k)\ 



,2A 



1 i=l 

V 



Z«? 



«Z»«Z(=^ 



.2* 



* i=i 







J } 


d 




-1/2 


f 


d ) 


cos 






+ i 2] fl/O'i 




sin 








\ 


1=1 


i=l 


.1=1 J 




\ 


1=1 



(23) 



where we have used 





2A 


/ d 


Z a ' a ' 




z 


,1=1 J 




Vi=i 


( d 

z^ 

V i=l 


d i-1 

+zz 

i=l 7=1 



2x* 



,2/, 



For convenience, let us define 



Aj(x, t) 



r 



Ai(x,t')dt' 



(24) 



(25a) 



and 



A(x, f) = 



. ^A,(*,f) 2 , 

) i=l 



(25b) 



then we get with (23) 

U Ai , (f + r, t, S) W(x, t) 

J Ai(x,t) 



cos 



¥(x, f) . 



The operator U A fe, h, 5) equals the time evolution opera- 
tor of the free particle Dirac Hamiltonian. In Fourier space it 
has the form 



U A (t + t, t, 6) = exp 



-<4 



l d 

C 

K 1=1 



^ aiiPi + mc 2 /3 



(27) 



where p, denotes the ;th component of the momentum vector 
p. In order to calculate the operator exponential in (27) we 
have to diagonalize the operator 



i=i 



H1A2 — c ^ crip; + mc 2 f3 
by introducing the scalars 

d±(p) = 



1 

— + 



1 



2 2^Jl+p 2 /(mc) 2 , 



1/2 



the unitary matrix 



u(p) = d+(p) + d-(p) T-fi ■ at ■ 

!=1 'P' 



(28) 



(29) 



(30) 



and its Hermitian adjoint u~(p). The matrix u(p)ifiA2U T (p) is 
diagonal [21] and it reads for two- or three-dimensional sys- 
tems 



u(p)ihA2U (p) 



(E(p) 

E(p) 

-E(p) 

1 -E(p)) 



with 



E(p) - yjm 2 c 4 + p 2 c 2 . 



(31) 



(32) 



Thus, the Fourier space operator (27) simplifies to 
D A (t + t, t , 6) ¥(p, t) = 



uipy 



e -iE(p) T /t, 000 

e - iE( p )T/n 

e i£( ^ )T/fi 











e 



iE(p)T/H 



u{ P m P ,f). (33) 



For the one-dimensional Dirac equation where the vector 
potential reduces to a scalar A \ (x, t) and the wave function 
> P(x, t) has only two components the operators U A (t + t, t, 6) 

and U A (t + t, t, 5) have to form 

U Ai2 (t + T,t,5)y(x, t) = 

(cos(-yAi(;t,f)) + iai smi-^Aiixj)))^^) (34) 

and 

ti* (t + T,t,$)¥(p,t) = 



(26) 



AE(p)r/h Q 







u{ P m, P ,t). (35) 
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III. GPU implementations of the 
Fourier split operator method 

111.1 - GPU computing 

As we have outlined in Section II, the Fourier split operator 
method consists of three elementary steps, the application of 
the two operators and Ufi plus the calculation of the fast 
Fourier transform and its inverse. All these three steps may 
be parallelized. The application of t/| or is embarrass- 
ingly parallel — each grid point can be updated independently 
from others — and also the fast Fourier transform may be par- 
allelized efficiently [22]. 

Scalable parallel performance, however, is difficult to at- 
tain on traditional CPU systems due to the imbalance of high 
CPU speed and relatively slow memory. The Fourier split op- 
erator method in particular is inherently memory bandwidth 
bounded because the application of or requires only 
O(N) operations and the calculation of the Fourier transform 
takes 0(N log N), where N denotes the number of grid points. 
Its low computational complexity is a very beneficial feature 
of the Fourier split operator method but it also means that 
there is no or little potential data reuse making the memory 
bandwidth a limiting factor. 

GPU computing [2, 3] may help to overcome these limita- 
tions. Due to its massively parallel architecture, GPUs reach 
a peak performance in floating point number operations that 
is more than one order of magnitude higher than the peak per- 
formance of current CPUs. GPUs also provide higher maxi- 
mal memory bandwidth. Intel's Core™ 7 CPUs, for example, 
reach up to 25.6 GB/s memory bandwidth while NVIDIA's 
Tesla™ M2050 and M2070 computing processors — which 
are based on GPU technology — offer up to 148 GB/s mem- 
ory bandwidth, see also Table 1. Modern GPUs attain their 
impressive computational performance figures without signif- 
icantly exceeding the power consumption of high-end CPUs. 
In fact, GPUs and other accelerators provide the best floating 
point performance per watt of current high performance com- 
puting architectures. 



III. 2. CUDA parallel architecture and 
programming model 

GPU computing is enabled by programming models that pro- 
vide a set of abstractions that enable to express data paral- 
lelism and task parallelism. These programming models are 
typically implemented by equipping a sequential general pur- 
pose programming language, as for example C or Fortran, 
with extensions for parallel programming and providing an 
application programming interface. OpenCL [23], Microsoft 
DirectCompute and CUDA by NVIDIA are the most popu- 
lar programming models for GPU computing. For our imple- 
mentation of the Fourier split operator method we chose the 
CUDA programming model (version 3.2) which works only 



TABLE 1 : Comparison of some technical key features of the high- 
end consumer graphics card NVIDIA GeForce GTX 480 and the 
computing processor module Tesla M2050 (source: NVIDIA). 

GeForce GTX 480 Tesla M2050 



Processor cores 
Processor core clock 
Memory 
Memory clock 
Memory bandwidth 
Power consumption 



480 
1.40 GHz 
1.5 GB 
1.848 GHz 
177 GB/s 
« 250 W 



448 
1.15 GHz 
3 GB 
1.546 GHz 
148 GB/s 
< 225 W 



for graphics hardware by NVIDIA but provides more flexibil- 
ity than OpenCL or DirectCompute. OpenCL is the platform 
independent but has no support for C++ and DirectCompute 
works only on Microsoft Windows systems. 

The term CUDA also refers to some GPU architectures by 
the hardware manufacturer NVIDIA. The latest CUDA GPU 
architecture is called Fermi. It features up to 512 computing 
cores which are organized in 16 streaming multiprocessors of 
32 cores each. Each of the 32 cores of a streaming multipro- 
cessor executes the same instruction on different data sets at 
the same time. Therefore, GPUs belong to the class Single In- 
struction, Multiple Data streams (SIMD) in Flynn's taxonomy 
[24]. 

GPU accelerator cards come in two different flavors, tradi- 
tional graphics cards, as the NVIDIA GeForce GTX 480, and 
dedicated computing processors, as the NVIDIA Tesla M2050. 
Table 1 compares some technical key features of these two 
cards. Both are based on the so-called Fermi architecture but 
differ, for example, in memory and clock rates. According to 
this table, the dedicated computing processor M2050 seems to 
be inferior to the GTX 480. However there are two important 
features that qualify the M2050 for high performance comput- 
ing applications. It has larger memory with error correction 
and full double precision performance. In consumer Fermi 
GPUs as the GTX 480, the number of double precision float- 
ing point operations that may be carried out per clock cycle 
is reduced by a factor of four as compared to dedicated GPU 
computing processors based on the Fermi architecture. 

A GPU has its own memory. This means, before one can 
run a GPU computation the input data has to be transferred 
from the computer's main memory (also called host memory) 
to the GPU device memory. GPU computations are carried 
out in device memory and final results will be transferred back 
into host memory. Memory transfer is a relatively slow op- 
eration. Therefore, one should reduce the number of mem- 
ory transfers to reach high performance. The CUDA archi- 
tecture provides three different kinds of host memory, non- 
pinned, pinned, and mapped memory. Memory transfers with 
pinned memory are faster than memory transfers with non- 
pinned memory. However, allocating pinned memory is slow. 
Mapped memory allows a GPU to work directly on data in 
host memory. Thus, with mapped memory there is no need 
to copy data but a GPU will access data in mapped memory 
not as fast as in device memory. We will study the impact of 
memory transfer in section IV. 1 . 
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FIG. 1 : Wall clock time to compute a one-dimensional (left panel) and two-dimensional (right panel) fast Fourier transform of complex-valued 
double precision arrays of size N and N x N, respectively. Computations are done in place, that is, the input data is overwritten by the output. 
Times T C p\j and T p V denote time for computation utilizing a single CPU core (Intel Core i7 CPU at 2.93 GHz) and the FFTW3 library and a 
GeForce GTX 480 GPU and the CUFFT library, respectively. In the GPU case, measurements were carried out including and excluding data 
transfer between host memory (nonpinned or pinned) and GPU memory. Mapped memory resides on the host eliminating the need for data 
transfer. 



III.3. Implementation 

We have developed highly tuned codes for the Fourier split 
operator method that allow us to propagate Schrtidinger wave 
functions and Dirac wave functions in one and two dimen- 
sions. For each equation and each dimension we implemented 
a conventional non-parallel CPU code as well as a CUDA 
code that performs all computations on a GPU. To carry out 
the fast Fourier transform CPU codes utilize the FFTW3 li- 
brary [25]. The GPU codes employ the CUFFT library for 
fast Fourier transforms and comprise light-wight kernels that 
accomplish the action of the operators and for each 
grid point in parallel. 



IV. Performance results 
IV.1. Fast Fourier transform 

The overall performance of the Fourier split operator method 
is mainly determined by the performance of the fast Fourier 
transform. Thus, we compare in Figure 1 the wall clock time 
7cpu that is required to calculate the fast Fourier transform on 
a CPU with the time Tgpu to perform the same task on a GPU. 
For not too small problems the GPU outperforms the CPU 
significantly. With our hardware setup (Intel Core i7 CPU at 
2.93 GHz with a GeForce GTX 480 GPU) we got a speedup 
up to a factor of about 30. For small problems, however, the 
overhead of starting and coordinating parallel CUDA threads 
prevents to archive good fast Fourier transform performance 
on GPUs. There is a critical problem size where the GPU 



starts to outperform a CPU in calculating a fast Fourier trans- 
form, which is for our hardware setup at data sets of about 2 12 
complex numbers. 

For GPU computations, one may use nonpinned, pinned, or 
mapped memory on the host. If one includes the time to copy 
data from host memory to device memory and back again after 
the calculation of the Fourier transform in the measurements 
then the time depends on the kind of memory. Pinned mem- 
ory is faster than nonpinned memory. Mapped memory elim- 
inates the need to copy data between host and device, how- 
ever, performing the fast Fourier transform on device memory 
plus copying is usually faster than calculating it in mapped 
host memory without data transfer. Only for one-dimensional 
Fourier transforms of intermediate size carrying out the fast 
Fourier transform in mapped host memory requires less time 
than performing the same task on device memory plus data 
transfer or using a (nonparallel) CPU implementation. 

We took also fast Fourier transform performance measure- 
ments for a Tesla M2050 computing processor. Despite the 
fact that the M2050 can perform four times more double pre- 
cision operations per clock cycle than the consumer GPU 
GeForce GTX 480, the Fourier transform performance of the 
Tesla M2050 is slightly lower than for the GeForce GTX 480. 
For a two-dimensional Fourier transform of a 4096 x 4096 ar- 
ray, for example, the Tesla M2050 reached only about 95 % of 
the GeForce GTX 480 GPU performance. For smaller arrays 
the performance difference was even larger. This may be inter- 
preted as that the fast Fourier transform performance is bound 
by the GPU's memory bandwidth rather than by its floating 
point performance. 

Our findings for the fast Fourier transform performance are 
consistent with the results presented in the literature. Refer- 
ence [26] reports a performance gain up to a factor of about 
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FIG. 2: Speedup Tcpu/^gpu f° r propagating a wave function over 128 time steps as a function of the grid size. The left panel depicts results 
for the one-dimensional Schrodinger equation of a grid of N data points while the right panel shows results for the two-dimensional Dirac 
equation of a grid of N x N data points. Performance measurements carried out on the same hardware as in Figure 1 and in double precision. 



1 1 in calculating two-dimensional single precision fast Fourier 
transforms on a NVIDIA GeForce 9800 GX2 GPU[27] in- 
stead on a conventional CPU. In [22] a 8- to 40-fold improve- 
ment was archived over highly tuned CPU routines by using 
an algorithm that efficiently exploits GPU shared memory. 

To sum up, the CUFFT GPU implementation of the fast 
Fourier transform is capable to outperform traditional CPU 
implementations by more than one order of magnitude. Data 
transfer between host memory and device memory poses a 
non-negligible overhead and should be avoided if possible or 
reduced by using pinned memory. 



IV.2. Fourier split operator method 

In order to determine the speedup that may be attained by 
switching from a CPU implementation to a GPU implementa- 
tion of the Fourier split operator method, we propagated one- 
and two-dimensional Gaussian wave packets in a harmonic po- 
tential over 128 time steps under the effect of the Schrodinger 
equation (9) and the Dirac equation (14), receptively. We mea- 
sured 

• the overall time to propagate 128 time steps (In the case 
of GPU computations, this includes data transfer between 
host memory and device memory before the first step and 
after the last step.), 

• the time to perform the fast Fourier transform plus its in- 
verse, 

• the time to apply the operator 0% in position space, and 

• the time to apply the operator in Fourier space 

for various grids of size N and N x N. Figure 2 shows 
the speedup Tcpu/Tgpu for these four different tasks for 
the one-dimensional Schrodinger equation and for the two- 
dimensional Dirac equation as a function of the grid size. If 



one considers the individual steps of the Fourier split oper- 
ator method, then one finds that the speedup for the appli- 
cation of U£ and 0% reaches between 50 to about 100 for 
large systems (more than about 2 18 grid points). For the fast 
Fourier transform, however, we get a speedup of only about 
30 limiting the overall speedup to about 40 for large systems, 
which is still a significant improvement over the CPU imple- 
mentation. Results for the two-dimensional Schrodinger equa- 
tion and the one-dimensional Dirac equation are not shown 
in Figure 2 because they are qualitatively very similar to 
the two-dimensional Dirac equation and the one-dimensional 
Schrodinger equation, respectively. 



V. Applications 

In this section we will show some applications of our Fourier 
split operator GPU codes for the solution of the time depen- 
dent Dirac equation. With conventional CPU codes [16], these 
kinds of applications would require more than an order of 
magnitude more computing time or may not be performed in 
an admissible amount of time. For our numerical simulation 
we adopt the atomic unit system (a. u.) that is established on 
the Bohr radius, the electron's mass, the reduced Planck con- 
stant and the absolute value of the electron's charge as its base 
units. 



V.1 . Evolution of a free Gaussian wave 
packet 

A flf-dimensional wave packet may be formed by superimpos- 
ing plane waves solutions of the Dirac equation with defined 
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Dirac wave packet at time t — 0.0125 a. u. is illustrated in 
FIG. 3. The position of the initial probability density's max- 
imum corresponds to the initial center of mass at the center 
of the coordinate system. The wave packet's center of mass 
moves in accordance with classical predictions with velocity 
\p\/(m yjl + p 2 /(mc) 2 ) w 38 a.u. along the jt-axis. Due to the 
finite speed of light [28], however, shock fronts emerge travel- 
ing approximately with the speed of light into the forward and 
backward directions. The maxima of the probability density 
no longer coincide with the center of mass. See [29, 30] for 
an investigation of similar relativistic effects. 



2 



FIG. 3: Probability density of a Dirac wave packed with narrow 
asymmetric momentum distribution at time t = 0.0125 a.u., see text 
for specific parameters. Due to the finite speed of light, the wave 
packet splits into two shock fronts traveling into opposite directions. 



momentum p, viz 

Tracked (-*0 



n 



(36) 



The quantity p(p) determines the momentum distribution. For 
a two-dimensional Gaussian distribution with mean momen- 
tum p and momentum widths cr j in forwards direction and cr s 
in sidewards direction the function p(p) reads 



PGaussianQ?) = «0) 



1 



(2^) 1 / 2 |i:| 1 / 4 



exp 



(p-p) T Z \p-p) 



(37) 

where u{p) denotes a column of the matrix u(p) (30) (selecting 
one the four momentum eigenstates with momentum p) and 
with the matrix 



and 



R 



cos(0) - sin(0) 
,sin(</>) cos(0) t 



1/C7*J 



and 



R~ 



tan< 



(38) 



P - ■ (39) 

Px 



The evolution of the probability density of a relativis- 
tic Dirac wave packet differs significantly from the non- 
relativistic theory of the Schrodinger equation, where a Gaus- 
sian wave packet broadens but remains its Gaussian shape for 
all times and the maximum of the probability density trav- 
els with the same speed as the motion of the center of mass 
does. We consider the propagation of a relativistic Gaussian 
Dirac wave packet in two dimensions with asymmetric mo- 
mentum widths o-f = 200 a. u. and <r s = 20 a. u., a mean 
momentum p = (40a.u.,0a. u.), positive energy and spin 
up. This corresponds to a rather narrow initial wave packet. 
The probability density of this two-dimensional relativistic 



V.2. Eigenstates 

Propagating a trail wave function ^(x, f) in a time independent 
potential from t — to some later time t — T allows us to 
determine the potential's eigenenergies [12] by calculating the 
autocorrelation function 



x(t) 



j y F(x,0)* x ¥(x,t)d d x. (40) 



The Fourier transform of the autocorrelation function 



X(E) 



X(t)(l - cos(2;rf/r)) exp(if£/ft) dt (41) 



exhibits pronounced peaks at the eigenenergies provided that 
the initial trial function is not orthogonal to some eigenfunc- 
tion. Once we have determined an eigenenergy E from the 
spectrum x(E), the potential's eigenfunction T'gOt) (assum- 
ing there is no degeneracy) with eigenenergy E results from 
the integral 



Jo 



*¥(x, - cos(27rf / r)) exp(i£f /K) dt . (42) 



For one- or two-dimensional model systems it is common 
practice to mimic the Coulomb potential 



V c (x) = -ecf>(x) = - 



4tt£oM 



by a soft-core potential 



V sc (x) = -e<f>(x) = - 



Ze 1 



4ne V* 2 + tf/Z) 2 



(43) 



(44) 



In (43) and (44) e denotes the elementary charge, Z the atomic 
number, eo the vacuum permittivity, and f the soft-core param- 
eter. 

If one replaces the three-dimensional Coulomb potential 
(43) by a lower dimensional soft-core potential (44), it is often 
appropriate to choose the soft-core parameter £ such that both 
potential share the same groundstate energy. In Schrodinger 
theory, the Coulomb potential (43) has a groundstate energy of 
-mc 2 (aZ) 2 /2, where we have introduced the electron mass m 
and the fine structure constant a = e 2 /{AjteqTic). Numerically 
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FIG. 4: Relativistic corrections to the ground state energy Eq for 
the two-dimensional soft-core potential (44) with £ = 0.791 and 
the three-dimensional soft-core potential for different values of the 
atomic number Z. In the non-relativistic Schrodinger theory, both 
potentials share approximately the same ground state energy of 
-mc 2 (aZ) 2 /2. 



we find that for ( « 1.4133 the one-dimensional soft-core po- 
tential has approximately the same groundstate energy as the 
three-dimensional Coulomb potential with the same atomic 
number. In two dimensions, one has to set £ ~ 0.791 in order 
to mach the groundstate energies. These statements hold for 
all Z. 

In Dirac's quantum theory, the groundstate energy of the 
three-dimensional Coulomb potential (43) is mc Vl - (aZ) 2 
and the soft-core parameter that matches the groundstate en- 
ergies depends on Z, however, it is close to the values for 
the Schrodinger case. The Dirac groundstate energy of the 
Coulomb potential equals the Schrodinger groundstate energy 
plus the rest mass energy mc 2 and relativistic corrections pro- 
portional in Z 4 , viz. 



c 2 s/l - (aZ) 2 — mc 2 - mc 2 



(aZ) 2 



,(aZ) 4 



(45) 



We determined the Dirac groundstate energy Eq of the two- 
dimensional soft-core potential (44) with (, = 0.791 as a func- 
tion of the atomic number Z with high numerical accuracy 
and found that relativistic corrections to the two-dimensional 
soft-core-potential's groundstate energy are weaker than for 
the three-dimensional Coulomb potential as shown in Fig. 4. 



V.3. Free wave packed scatting at a light 
pulse 

At high laser intensities, charged particles in electromagnetic 
waves are not only affected by the electric field component but 
also by the magnetic field component. At intensities larger 
than about 55mhur /{cLi^q 2 ), the Lorentz force becomes rel- 
evant and at intensities above 0.lm 2 c 2 u> 2 /q 2 also relativistic 
effects have to be taken into account [31], here to denotes the 
laser's angular frequency and /iq the vacuum permeability. 



We simulated the two-dimensional motion of a free wave 
packed in a laser pulse with the electromagnetic fields 

E(x, t) = E sin(£ ■ x - cot)fjj(k ■ x - tot) , (46a) 

B{x, t) = — sin(k ■ x - a>t)fjj(k ■ x - cot) (46b) 



with an envelope function fjj(r]) of j half cycles with a linear 
turn-on ramp and a linear turn-off ramp of / half cycles and a 
constant plateau in between, viz. 



fjM 



(7] + jn)l{ln) if -j < Tj/jr < -j + I, 

1 if -j + I < J]/n < -I, 

-T)/(ln) if-l<t]/n<0, 

else 



(46c) 



and \k\ = 2n/A = co/c and Eq ± k. Figure 5 shows the 
wave-function's probability density at different times for a free 
wave packed scatting at a strong laser pulse traveling along the 
x-direction and having an overall length of eight half cycles 
and turn-on/off ramps of two half cycles. Its field strength is 
\Eq\ = 3000 a. u. and its wave length A — 40 a. u. The electric 
field component accelerates the wave packet back and forth 
along the y-direction, while the Lorenz force causes a drift 
into the x-direction. 



V.4. Ionization 

In our fourth example, we consider ionization in ultra-strong 
and ultra-short laser pulses. The Dirac ground state wave 
packet of a soft-core potential (44) with Z = 32 and £ = 0.791 
is excited by an external laser pulse (46) with wavelength 
A = 10 a. u. and peak electric field strength |Z?ol = 3072 a. u.. 
We consider a short laser pulse of four half-cycles including a 
tun-on and a turn-off half-cycle. Figure 6 shows the electron 
density after the laser pulse has passed the atomic core. In 
the setup, we have chosen here, the laser travels from left to 
right and the electric field points into the y-direction, acceler- 
ating the wave packet into the positive y-direction in the first 
and third half-cycles and into the negative y-direction during 
the second and fourth half-cycles. The Lorenz force causes an 
acceleration into the laser's propagation direction. The strong 
interference patters in Fig. 6 form by the interaction of wave- 
function's different components traveling into different direc- 
tions. 

Note the huge wave function spreading during the ioniza- 
tion. While the ground state wave function has a width of 
about 1/32 a. u., the wave function after ionization spreads 
over several atomic units as shown in Fig. 6. Wave function 
spreading poses a major challenge in the numerical simulation 
of light matter interaction at high intensities limiting the sim- 
ulations to very short time scales where the size of the wave 
function remains of manageable size. The simulation of the 
ionization process of Fig. 6 to only about ten minutes using 
our GPU Fourier split operator implementation. Thus, the sim- 
ulation of even longer interactions with laser pulses of longer 
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FIG. 5: Free wave packed scatting at a strong laser pulse. The false color plots show the wave-packet's probability density at different points 
in time t. The solid gray line indicates the center of mass trajectory. The laser pulse travels from left to right. See text for detailed parameters. 
Note that the computational grid follows the center of mass motion. 




x/&. u. 



FIG. 6: A wave-function's probability density after ionization from 
the ground state of a soft-core potential (44) by a ultra-strong laser 
pulse, see text for details. 



VI. Conclusions and outlook 

In this contribution we evaluated GPUs as a massively paral- 
lel computing architecture for the solution of time dependent 
partial differential equations by means of the Fourier split op- 
erator method. The computational complexity of the Fourier 
split operator method is dominated by the computational com- 
plexity of the fast Fourier transform. We demonstrated that 
GPUs reach much better performance in computing the fast 
Fourier transform than current CPUs. Depending on the prob- 
lem size, the performance gain may exceed one order of mag- 
nitude as compared to sequential CPU implementations. Thus, 
the Fourier split operator method may be implemented very 
efficiently on GPU architectures as we demonstrated for the 
time dependent Schrtidinger equation and the time dependent 
Dirac equation. The combination of a highly parallel architec- 
ture with a high-throughput memory makes graphics process- 
ing units a very attractive architecture for implementing the 
Fourier split operator method. Best performance is attained if 
all steps of the Fourier split operator method are carried out 
by the GPU avoiding data transfer between host memory and 
GPU memory. 

The fast Fourier transform is a core building block for 
the solution of partial differential equations as well as of 
many other problems from computational physics, signal pro- 
cessing, tomography, computational finance and other fields. 
Thus, we expect that also these problems may be solved much 
more efficiently on GPU architectures than on conventional 
CPUs. Taking into account that GPU computing is a rather 
young field it is supposed that there is still plenty potential for 
GPU technology and codes to mature further and to find new 
applications. 



wave lengths has become feasible by our GPU implementa- 
tion. 
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